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■"sj" . Abstract 

■ We show that a straightforward extension of a simple learning model based on the Hebb 
I rule, the previously introduced Association-Reinforcement-Hebb-Rule, can cope with "de- 

■ layed" , unspecific reinforcement also in the case of structured data and lead to perfect gen- 
' eralization. 

B 

. 1 Introduction 

c : 

O ■ Learning from unspecific reinforcement may be essential in various contexts, both natural and 
artificial, where, typically, the results of particular actions add to a final consequence which only 
is valuated. The freedom residing in each step is not (or only partially) controlled directly and 
^ . the learner must cope with the necessity of improving its performance only from information 
^ [ concerning the final success of a complex series of actions. 

It is therefore important to find out whether there are simple and robust procedures for such 
situations, which might have developed under natural conditions and which may be basic also for 
artificial learning rules (for this reason we do not consider evolved algorithms like Q-learning |||] , 
TD learning Q etc) . In previous works |Q , Q we have introduced an "Association-Reinforcement" 
learning model based on the following conception: 

1. For each given input (external situation) the agent answers with an action (operation) 
depending solely on the input and on its instantaneous internal (cognitive) structure and si- 
multaneously strengthens (in its internal structure) the blind association between this particular 
input and action. 

2. At the end of a series of actions (path) the final success is judged. Then the associa- 
tions "situation - operation" which have been involved on this path are re-weighted equally and 
depending only on the final success - unspecific reinforcement. 
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In Q we studied an implementation of this model to a classification problem for perceptrons, 
the Association-Reinforcement-Hebb-rule and showed some amazing properties: 

a) Despite the fact that feedback on the learner's performance enters its learning dynamics 
only in an unspecific way in that it cannot be associated with single identifiable correct or 
incorrect associations, convergence of the AR-Hebb-algorithm in the sense of asymptotically 
perfect generalization is found. 

b) For given initial conditions, this convergence depends on the learning parameters charac- 
terizing the 2 steps described above; in particular none of these steps can be completely inhibited. 
Alternatively, for given algorithm parameters convergence may depend on initial conditions. 

In detail the dynamics of this algorithm was found to be very complex and interesting, 
being controlled by fixed points in the pre-asymptotic regime, and having a continuous set of 
asymptotic convergence laws. These results could easily be extended to the more realistic case 
where in the second step the unspecific reinforcement is randomly applied to only part of the 
associations achieved in the first step (the agent does not recall everything it has done on the 
trial) @]. Further interesting extensions concern the question of structured data and of multi- 
layer perceptrons. 

Structured data represent a more involved classification problem and it is known that when 
teacher and data vector are not fully aligned (or exactly uncorrelated) the usual Hebb rule does 
not lead to convergence of the student vector onto that of the teacher, while the perceptron 
algorithm does |^. On the other hand, the limiting case of the AR-Hebb-rule corresponding to 
the perceptron rule has been shown not to converge in the case of unspecific reinforcement for 
non-structured data. It is therefore a non-trivial question whether the unspecific reinforcement 
problem can be solved for structured data and in particular, whether some immediate extension 
of the AR-Hebb-rule can be shown to converge in this case. It is this question which we shall 
address in this paper. In a future publication we shall treat the problem of the committee 
machine as a first step to multi-layer perceptrons. 

In section 2 we shall describe the learning model in the general setting and in section 3 we 
shall discuss its convergence properties providing numerical and analytical results. Thereby we 
shall briefly recall the non-structured data case and then concentrate on the general, structured 
data case. Section 4 contains the conclusions. 



2 Learning rule for perceptrons under unspecific reinforcement 

We consider one layer perceptrons with Ising or real number units Si, real weights (synapses) Ji 
and one Ising output unit: 




Here N is the number of input nodes, and we put no explicit thresholds. The network (student) 
is presented with a series of patterns Si = ■^^-'^''^ 9 = 1, Q, ^ = 1, L to which it answers with 
gi<i'0 _ A training period consists of the successive presentation of L patterns. The answers are 
compared with the corresponding answers t^*^''-* of a teacher with pre-given weights Bi and the 
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average error made by the student over one training period is calculated: 

^ L 



1=1 

The training algorithm consists of two parts: 

I. - a "blind" Hebb-type association at each presentation of a pattern: 



JiW+^) = J,W) + - sW) (3) 



II. - an "unspecific" but graded reinforcement proportional to the average error Cg introduced 
in @, also Hebbian, at the end of each training period, 

1=1 

where is the average error eq. (|2|) and ri is a dichotomic random variable: 

{1 with probability w , . 

with probability \ — w 

Because of these 2 steps we called this algorithm "association/reinforcement(AR)-Hebb-rule" . 
We are interested in the behavior with the number of iterations q of the generalization error 
£9(9): 

eg{q) = iarccos (jy^^) > (6) 
in particular we shall test whether the behavior of eg{q) follows a power law at large q: 

eg{q) ~ const q~^ . (7) 
The training patterns {Cf'''^^} are generated randomly from the following distribution: 

^ a=±l 
N -, 

P{i\a) = n e-|«— ^')' (8) 
i=i V 2^ 

and we take: 

= = iV, C-B = r]N (9) 
with fixed, given m, rj. Notice the following features: 
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a) During training the student only uses its own associations ^('J'^) <-> s(9'0 and the average 
error Cq which does not refer specifically to the particular steps I. 

b) Since the answers s^'^'') are made on the basis of the instantaneous weight values J^'J'O 
which change at each step according to eq. (3), the series of answers form a correlated 
sequence with each step depending on the previous one. Therefore eg measures in fact the 
performance of a "path" , an interdependent set of decisions. 

c) In contrast with the case studied in Q the patterns can now have a structure. This 
introduces essential differences to the previous situation, as we shall see in the next section. 

d) We explicitly account for imperfect recall at the reinforcement step by the parameter w 
. This introduces a supplementary, biologically motivated randomness which, as already 

suggested in [Q], does not appear to introduce qualitative changes in the results, however 
(see section 3). 

e) For L = 1 (and w = 1) the algorithm reproduces the usual "perceptron rule" (for oi = 0) 
or to the usual "unsupervised Hebb rule" (for 02 = 2ai) for on-line learning, for which the 
corresponding asymptotic behavior is known |Q, [^, |^]. 

To study the learning behaviour we use Monte Carlo simulation and coarse grained analysis. 
The latter is provided by combining the blind association (^) during a learning period of L 
elementary steps and the graded unspecific reinforcement @) at the end of each learning period 
into one coarse grained step 



J, 



(q+i,!) 



:n(-^|..,-)-sign(-^|..,-) 



-E 



(10) 



(11) 



For simplicity we shall take for the time being = 1, i. e. w = 1 in eq. (^). We introduce 

a = qL/N, A = 01/02 (12) 
and rescale everything with 02, which means that we can take without loss of generality 02 = 1 



(13) 



in (|10|),(11). We define the overlaps: 



7^(a) = Is • J(«-') ,Q(a) = l[j('^'')p , P(«) = Ic • J^') 



Note that in the "thermodynamic limit" L/N ^ the overlaps are self-averaging and we can 
neglect the dependence of TZ, V and Q on /. We shall follow standard procedures P], 
|10|. Treating a as a continuous variable and using: 



TTtg = arccos 
V 



n 



y 

z 



arccos 
arccos r/ 



(14) 

(15) 
(16) 
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we obtain the coarse grained equations: 



da 
dV 

da 
dVQ 
da 



cos X 



-r- (VQcosy 
da V 



1 



SjtAjt, (17) 
SjtAjc, (18) 



A 



1\ . 1 . 1 



A 



SjtAjj + 



1 

+ 4Z 



(19) 



where the expectation values A,, S, are given in Appendix (section pTlD . These equations describe 
the flow of the three quantities €g = x/ir, Q and y with a and involve the data/teacher parameters 
m and z = arcosr/ and the learning parameter A. Note the geometric constraint: 



y -2 ^11 

sm sm — = LO sm — , w < 1. 

2 2 2' ' ' - 



(20) 



3 Convergence behaviour of the AR-Hebb algorithm 
3.1 Non-structured data 

The case of non-structured data - m = in eq. (^5|)-(^3|) - has been treated in here we 
briefly recall some of the results for the later comparison with the structured data case. 

Monte Carlo simulations indicate that in spite of the partial information contained in the 
unspecific reinforcement perfect generalization is achieved by the AR-Hebb algorithm and it 
depends on the learning parameters ~ seeH. This intriguing behaviour is elucidated by the 
coarse grained analysis. In this case eqs. (|T7|V(|l9|) reduce to two equations (for TZ and Q) which 
have as general asymptotic solutions 



1 



e2 ^ 



a ^ + c\a 



g 



1 



Q 



2tt 



-X^a^ 

TT 



In a -|- C2 ) a ^ for A 



for A / y 



(21) 

(22) 
(23) 



at large a. We see that for A < we obtain asymptotically perfect generalization, the dominant 
term exhibiting the usual power -1/2 , while for A > the second term in ( pl| ) dominates and 



ensures again perfect generalization but with a different power law, — 1/(2AL). For A = we 



obtain logarithmic corrections - see eq. (^2|). Notice that these results hold also for L = 1. One 
can generally see that for A = one cannot have perfect generalization for L > 1. For L = 1 one 
re-obtains the asymptotic behavior found in |Q. 

This learning algorithm is further characterized by highly interesting pre-asymptotics, dom- 
inated by two stationarity conditions, one for the self-overlap, dQ/da = 0, and one for the 
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generalization error deg/da = 0. For suitable values of the network parameters, the two sta- 
tionarity conditions may simultaneously be satisfied, leading to fixed points of the learning dy- 
namics, one of them stable and of poor generalization, the other with one attractive and one 
repulsive direction. Correspondingly, the flow is divided by a separatrix defined by a critical 
^c{Qo) into trajectories leading to convergence according to the asymptotic behaviour pl|)-(p3[) 
for A > Ac(Qo)) oi' to poor generalization otherwise. 

The salient features of these results for the case of non-structured data are the convergence 
of the AR-Hebb-algorithm in the sense of asymptotically perfect generalization with a power law 
depending on the learning parameters L and A and the existence of a minimal value Ac(Qo)) 
fixed by the pre-asymptotic structure and below which the system is driven toward complete 
confusion. Notice also that the best convergence is achieved for A just above Ac- One last point 
concerns the recalling parameter w, eqs. A rough first quantitative characterization 

of this modification would be that it leads to an effective rescaling of the parameter A, viz. 
A A/w, leading to a corresponding reduction of critical A's by approximately a factor p. This 
is well supported by numerical simulations (see also Fig. |^ for the case of structured data) and 
we conclude that the algorithm is stable against this supplementary element of indeterminism. 

3.2 Structured data 

Numerical simulations indicate that for m 7^ and < |r/| < 1 the behaviour of the algorithm 
for all w is more involved: generically, no convergence is found in this case for fixed values of 
the learning parameters. This agrees with the expectations, since, on the one hand the situation 
found at L = 1, t(; = 1 for structured data |^] could be expected to hold for every L, namely 
that Hebb updating leads to a nonzero asymptotic generalization error. On the other hand, the 
situation found before for non-structured data should hold also for structured data, namely that 
the perceptron rule (A = 0) (which for L = 1 was shown to lead to convergence also in the 
structured data case [^) does not work for L > 1. 

In fact one can make a more general argument that for fixed A the AR-Hebb rule does not lead 
to perfect generalization for generically structured data. To obtain good generalization requires 
'R-IVQ — > 1 and T>l\fQ r], from which one may obtain the necessary dominant scaling (with 
Q) of the various integrals appearing in (|35[)-(^3[), namely 



AjT 

Atj 
Ajj 
Ajc 
Arc 

A.XT 
SjT 



K 



K 



K 
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with 




(24) 
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Figure 1: Generalization error Eg vs. q = aN/L for L = 10, N = 100, various overlaps ij and 
starting point \J Q(0) = 100. We use either patterns with fj.' real numbers or ^j.' Ising spins 
(I) and A = Ao/\/a or X = Aoe(a)/e(l), see (34)- w is the recall probability ^,^). Note the 
change in behaviour between Aq = 4 and Aq = 6. The straight lines are illustrative power laws. 



This in turn would lead to the following asymptotic expressions for the flow equations ([l7|)-(p!9[) 
(at fixed A) 

K A, 
K A, 

The solution at large a would be ~ k A q + i?o ^-i^d D ~ k A a + Dq, while Q is asymptotically 
given through the implicit equation 

VQc^lKXa + - ln{VQK + X) + IkXko ■ (25) 
2 k 2 

Here Rq, Dq, and kq are integration constants. Hence, asymptotically, y/Q ~ ^kAo which is 



dn 

da 
dV 

da 
dQ 



incompatible with the requirement of good generahzation 1. Thus the algorithm will 

not converge, if A is kept fixed. 

The question arises, however, whether a simple extension of the algorithm may not overcome 
the Odyssean dilemma hinted at in the beginning of this section. We hence suggest to tune the 
parameter A such that it is large enough at small a to overcome the pre-asymptotic conditions 
and it tends to zero at large a in order to approach asymptotically the perceptron rule. As can 
be seen on Fig. |l] this procedure seems successful. 

Since the situation is now much more complicated we shall not try to solve the general 
asymptotic problem, as we did in the case of non-structured data, but we shall limit ourselves 
to prove that robust solutions exist. For this we start with the following ansatz: 

(26) 
(27) 
(28) 
(29) 



A 


= Ao a ^ 


Q 


= c^a^'/, 




= aa~^. 


UJ 


= ba-'. 



with UJ defined via (|20|). The asymptotic equations obtained from the flow equations (|17|)-(|19D 
assuming p~g~r>s>0 are of the form: 

^ A,,X + ^ + A2e,, (30) 
da yjQ 



2/QsinJe3^ ~ 5ii A + + ^2 £3, (31) 
2 " da VQ 



^ ^ C0A + C163, (32) 
da 

Here the coefficients A^, B^, are function of m, z, L and of w (the explicit expressions obtained 
by Maple are given in Appendix, section 5.2). 



It is easy to see that an asymptotic solution can exist for: 

p = q = r = 1/2, s = 0, (33) 

which is therefore compatible with the assumptions used to derive the asymptotic equations 
(3C)-(|32[). Then a,b,c are obtained as function of m,r],L for given Aq, with some restrictions 



on the latter (notice that the coefficients A^, B^,Cry depend nonlinearly on uj, hence on b). For 
illustration, we show in Fig. ^ the values of a, b and c as function of Aq for L = 10, m = 1 and 
two values of the data-teacher overlap rj. Notice that there is no asymptotic solution for Aq below 
~ 0.2. 



In Fig. 1^ we show the solution of the full equations (p^-(p!9|) - compare also with Fig. || - 
which can be seen to approach the asymptotic solution (p^)-(|33|). The solutions are robust in the 
sense that for all m, r/, L there exists a large region of Aq leading to convergence according to (|33|). 
Notice, however, that in the pre-asymptotic region similar phenomena to the non-structured data 
case seem to take place: the flow is divided by a separatrix defined by a Ao,c (the MC simulation 
presents the same effect, see Fig. |l]). 
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Figure 2: Asymptotic solution (2t)-(3S) for L = 10; o/Xq, b and c as function of Xq for two 
values of the overlap, tj = 0.28, 0.6. There are generally two solutions a, ±6, c with b practically 
independent on rj. Note that there is no solution for Xq < Aq^J^™^*' ~ 0.2. 



We have thus shown that the simple decrease of A as 1/ ^/a provides convergence to asymptotic 
perfect generalization with the power —1/2. Alternatively, one can decrease A as l/\/Q) or as 
e(a), where 



e{a) = — 
a 



aN/L 
g=l 



(34) 



using the running "observed error" Cg (^) (this is in a sense the most natural choice, since the 
student only applies its observations). Again the algorithm is stable against noise or a further 
dilution of the information introduced by taking w < 1. See Fig. |l[ 



4 Summary and Discussion 

In the present paper we have investigated the performance of the AR-Hebb-algorithm introduced 
in Q in the case where the input patterns are structured. The pattern statistics is characterized 
by the anisotropy vector mC and performance of the learning rule depends on m and on the 
overlap t] between the anisotropy vector and the vector B that defines the rule - apart from the 
parameters A and L which characterize the AR-Hebb-algorithm. 
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Figure 3: Solution of the flow equations (17)-^iT^) for L = 10, starting point \/ Q(0) = 100 and 
various overlaps rj: flow with a in the eg-^/Q plane (upper left), eg vs a (upper right), lo vs a 
(lower left) and \fQ, vs a (lower right). Note the change in behaviour between Aq = 3.9 and 
Xq = 4.1, compare with Fig. [J. 



As for usual Hebb learning, a tuning of learning parameters is required to achieve good 
generalization for the classification of structured patterns. Given L, the only free parameter of 
the algorithm is A, and tuning of A may proceed in various ways. For instance, one may scale A 
either with a, i. e., with the number of input-output pairs presented, or with the self-overlap Q, or 
with the empirical error-rate Cq. Our analysis reveals that the scaling A ~ a~^/^, which according 
to that analysis is equivalent to the scalings A ~ Q~^/^, or A ~ e(a), leads to asymptotically 
perfect generalization. The behaviour is robust in the sense that the prefactor Aq may be varied 
over a wide range without changing the asymptotic scaling of the generalization error. In this 
sense the tuning required to obtain a working algorithm is not fine-tuning. The only requirement 
for obtaining good generalization is that Aq in ( p6|) exceeds a certain minimum value, Aq^J^"^^*'. 
This behaviour is reminiscent of the fact that a minimum value of A was also required in the case 
of unstructured data. In that case, however, the reason was entirely related to pre-asymptotic 
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behaviour related to the fixed-point structure of the flow equations, whereas the above analysis 
is restricted to the asymptotic domain. 

From the numerical solution of the full fiow-equations and from simulations, we see some 
empirical evidence that a non-trivial fixed point structure governing the pre-asymptotic behaviour 
in analogy to what has been found in Q is present also in the case studied here. As the present 
dynamical problem is i/iree-dimensional instead two-dimensional, however, the consequences of 
this might be suspected to be less severe. For instance, a fixed-point with stable and unstable 
directions in three dimensions does not necessarily produce a separatrix as in the two-dimensional 
case. However, the projection onto the e^-^/Q plane shows a separatrix and hence a Ao,c; as in 
the unstructured data case (see Fig. ^ , with Ao,c > Aq''^^™'^*' . Unlike in the two-dimensional case 
with unstructured patterns, we have so far not found any evidence of non-universal behaviour 
of the generalization curve. Whether this is intrinsically related to the different role fixed points 
appear to play in the present case, we do at present not know. Exceptional behaviour appears 
for r] = which in the student-teacher scenario, however, is equivalent to the unstructured case. 
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Neural Networks' in Dresden, March 1999. The authors would like to thank the Max Planck 
Institut fiir Physik Komplexer Systeme in Dresden for hospitality and financial support and the 
participants to the workshop for interesting discussions. 



5 Appendix 

5.1 Expectations values 

The expectations values A., 5*. in (|T7|)-(p^ are: 



AjT = mr]ip{m—=) + \ -=e 2q 

/2 12 2 

— cosxe~2"^ (35) 
TT 

/o" 2 2 

Z _rn__n_ 
— e 2 
TT 

= m COS z(p{m COS z) + J- 6'^'^^''°''^^, (36) 

V TT 

SjT = l + c^(m^)-^(mry)-4G(-^,-^,r/) 

= 1 + ip{m cosy) — ip{m cos z) — 4 G{cos X, cosy , cos z), (37) 

V . V . [2 m2p2 

Ajj = m ^= (p{m + \ - e 2q 



/ 2 12 2 

= m cosy ip{m cosy) + \ -e~^"'^''"''^y, (38) 

V TT 

V , ^ [2 TZ _2nV 
Atj = m^=(p{mri) + \ - ^=e 2 
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where 



= m cosy Lp{m cos z) + \ — cos xe ^"^ ^, 

V vr 

V ^ [2 V 

Ajc = mip{m—=) + \- —=e 2s 

= rrnp{m cos y) + \ — cosy e~^^^ y , 

V vr 

^TC = m ip{m rj) + \l — 7] e 2 

= m (/^(m cos z) + W — cos z e~5™^ 

V vr 



1 

2^ |.mcosy 2/ ftCOSX-m 

e-2* 1 + 



2 J-00 V2vr V V sinx 

= erf(x/\/2) 

with erf the error function. 

5.2 Asymptotic coefRcients 

The Maple expressions for the coefficients A^,B^,C^ in (^)-(|3^) are: 

u = -m cos(z) \/2 
V = msin(— z) \f2 



f{vuj) = V uj erf{v uj) + 



vr 

sin(i z) erf (ti) uj 



A12 
A2 

B12 



IT 

e~" f{voj) 



-2 i±^^ + 4 (1 - erf (tx) /(t; u;)) 



^27r 

msin(i zy erf(ti) (a;^ — 1 + s^) 



vr 

„2 



sin(iz)e " u)f{vijj) 
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Co = V2ueif{u) + ^ ^ 
Ci = -(1- y)e-"V(^^w)(\/7rmerf(t/)cos(z) + V2e-"') 




(54) 



(55) 
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